Discrimination of Escherichia coli, Shigella flexneri, and Shigella sonnei using lipid profiling by MALDI‐TOF mass spectrometry paired with machine learning

Abstract Matrix‐assisted laser desorption/ionization‐time of flight mass spectrometry (MALDI‐TOF MS) has become a staple in clinical microbiology laboratories. Protein‐profiling of bacteria using this technique has accelerated the identification of pathogens in diagnostic workflows. Recently, lipid profiling has emerged as a way to complement bacterial identification where protein‐based methods fail to provide accurate results. This study aimed to address the challenge of rapid discrimination between Escherichia coli and Shigella spp. using MALDI‐TOF MS in the negative ion mode for lipid profiling coupled with machine learning. Both E. coli and Shigella species are closely related; they share high sequence homology, reported for 16S rRNA gene sequence similarities between E. coli and Shigella spp. exceeding 99%, and a similar protein expression pattern but are epidemiologically distinct. A bacterial collection of 45 E. coli, 48 Shigella flexneri, and 62 Shigella sonnei clinical isolates were submitted to lipid profiling in negative ion mode using the MALDI Biotyper Sirius® system after treatment with mild‐acid hydrolysis (acetic acid 1% v/v for 15 min at 98°C). Spectra were then analyzed using our in‐house machine learning algorithm and top‐ranked features used for the discrimination of the bacterial species. Here, as a proof‐of‐concept, we showed that lipid profiling might have the potential to differentiate E. coli from Shigella species using the analysis of the top five ranked features obtained by MALDI‐TOF MS in the negative ion mode of the MALDI Biotyper Sirius® system. Based on this new approach, MALDI‐TOF MS analysis of lipids might help pave the way toward these goals.


| INTRODUCTION
There are four commonly recognized Shigella species (Shigella boydii, Shigella dysenteriae, Shigella flexneri, and Shigella sonnei), all of which may cause the well-characterized disease known as shigellosis (Niyogi, 2005). In contrast, Escherichia coli strains in the human gut are typically commensal, although some pathovars can cause diarrhea. Shigellosis is endemic throughout the world and is responsible for nearly 165 million cases of severe dysentery each year (Kotloff et al., 1999;Niyogi, 2005). Since shigellosis is highly communicable (<100 viable cells can produce disease in healthy adults), it is a serious health concern at childcare centers and in developing countries with poor sanitation conditions. For example, in the United States, approximately 14,000 cases of shigellosis occur each year, with S. flexneri and S. sonnei identified as the predominant pathogens (Khalil et al., 2018). The Shiga-toxin-producing species S. dysenteriae, although infrequently isolated in the United States, may produce a more-serious disease that can be fatal if left untreated.
Shigella species and E. coli are very closely related Gram-negative bacteria belonging to the Enterobacterales. Phenotypically, Shigella species and E. coli share many common characteristics; genotypically, they could be considered the same species but present different infectiousness and clinical outcomes (Halimeh et al., 2021;Kaper et al., 2004;Pupo et al., 2000;van den Beld et al., 2019). Due to this close relatedness, the differentiation of Shigella species from E. coli can be difficult and time-consuming. Nowadays, the diagnosis of shigellosis is based on the isolation of the pathogen from stool culture on conventional screening media, biochemical assays, and molecular detection such as 16S rRNA sequencing and/or amplification of the invasion plasmid antigen H (ipaH) (de Boer et al., 2010;Schaumburg et al., 2021;Van Lint et al., 2016;Vu et al., 2004;Zimmermann et al., 2020). The antibiotic treatment has to be implemented only if a true pathogen (i.e., Shigella) is identified, not if only commensal E. coli are isolated. Unfortunately, both Shigella and E. coli can grow on screening media. Currently, methods based on biochemical tests and serotyping are preferred for the discrimination of these species. However, these approaches may have suboptimal diagnostic performance as they are slow, relying on multiple-step methods of culturing on selective agar, slide agglutination tests, and the use of commercial biochemical identification kits. In addition, Shigella species and E. coli are undistinguishable using molecular methods such as sequencing the 16S rRNA gene or molecular syndromic panel (usually a detection of the invasin gene inv i.e. common to Shigella spp. and enteroinvasive E. coli isolates) (Schaumburg et al., 2021;Zimmermann et al., 2020) as well as routine matrix-assisted laser desorption/ionization-time of flight mass spectrometry (MALDI-TOF MS) (van den Beld et al., 2022).
Indeed, protein-based MALDI-TOF MS, which is now the gold standard for bacterial identification in clinical microbiology laboratories, is unable to provide accurate differentiation of Shigella spp. and E. coli, likely due to their similar protein profiles (Devanga Ragupathi et al., 2018). Despite few studies reporting the possibility to discriminate Shigella spp. from E. coli using classical MALDI-TOF MS with a specific reference library (Paauw et al., 2015) or algorithm for peaks interpretation (Khot & Fisher, 2013), these methods have never been implemented in routine testing (van den Beld et al., 2022).
Accordingly, definitive discrimination between E. coli and Shigella spp. still relies on biochemical characters assessed in an additional 24 h using a biochemical gallery (e.g., API20E strip, Vitek ® 2 GN identification card). Then, serotyping can be performed to definitively discriminate between the four species: S. boydii, S. dysenteriae, S. flexneri, and S. sonnei ( Figure A1). Despite literature showing that some lipids such as lipid A and LPS composition in Shigella species display some differences like numbers of acylation or presence of phosphoethanolamine groups (Casabuono et al., 2012), a lipid-based MALDI-TOF MS method had not yet been attempted as a rapid diagnostic tool.
This study aims to accelerate the application of routine MALDI-TOF MS to address the public health challenge of the rapid discrimination between Shigella spp. and E. coli. To do so, we have explored the use of lipid profiling to discriminate between the closely related species of E. coli and the most prevalent Shigella species (S. sonnei and S. flexneri), for which speed of diagnostics is crucial to treat the patient and prevent and control outbreaks. Analysis using MALDI-TOF MS in the negative-ion mode combined with a machine learning algorithm demonstrated that it is possible to tell apart E. coli, S. flexneri, and S. sonnei using lipid profiles. Accordingly, if a positive signal was obtained by multiplex PCR for enteroinvasive E. coli/Shigella spp., a Salmonella/Shigella agar (bioMérieux, la Balme les Grottes, France) and a Hecktoen agar (bioMérieux) were used for culture. On colonies that cultured after 24 h incubation, identification was performed using an API20E biochemical strip (bioMérieux) or Vitek ® 2 GN identification card (bioMérieux), allowing discrimination between E. coli and Shigella spp.

| Bacterial strains
Then, serotyping was performed to definitively discriminate between the four species: S. boydii, S. dysenteriae, S. flexneri, and S. sonnei. In the routine workflow, if the API20E biochemical strip or the Vitek2 identified a sample as E. coli, multiplex PCR (syndromic molecular panel) might be performed on the bacterial colony to verify if this E. coli isolate corresponds to an enteroinvasive strain (acquisition of the Shigella invasin gene inv).

| Sample preparation for lipid profiling
One bacterial colony was resuspended in 100 µL of water. This bacterial suspension was centrifuged at ×1000g for 10 min, then

| MALDI-TOF MS analysis
The spectra were recorded in the linear negative-ion mode (laser intensity 95%, ion source 1 = 10.00 kV, ion source 2 = 8.98 kV, lens = 3.00 kV, detector voltage = 2652 V, pulsed ion extraction = 150 ns) using MALDI Biotyper Sirius ® system (Bruker Daltonics). Each spectrum corresponded to an ion accumulation of 5000 laser shots randomly distributed on the spot for the range m/z 1000 to m/z 2500. The spectra obtained were processed with default parameters using FlexAnalysis v.3.4 software (Bruker Daltonics).

| Pre-processing of lipid spectra data
The bioinformatics analysis pipeline used R version 4.1.2. The method described here used code adapted from a study by Gibb & Strimmer (Gibb & Strimmer, 2015). "MALDIquant" (version 1.21) and "MAL-DIquantForeign" (version 0.13) packages were used to pre-process the spectra data for all E. coli, S. sonnei, and S. flexneri samples. First, a square root transformation (sqrt) was performed on the intensities of the spectra. The intensity values were then smoothed using the Savitzky-Golay method (Steinier et al., 1972). The baseline of the mass spectrometry data was estimated and then removed using the statistics-sensitive nonlinear iterative peak-clipping (SNIP) algorithm. Intensity values were normalized using the total ion current method then spectra were aligned. A signal-to-noise ratio of 3 (SNR = 3) and a half window size of 20 (HWS = 20) were used to detect peaks above the defined threshold in the mass spectrometry data. Following this, the peak binning function was used to look for similar peaks across different spectra and equalize their mass. Finally, peaks that occurred infrequently within the same species group were removed from the data. After this pre-processing, the result was a two-dimensional feature matrix containing peak intensity information for the spectra of all samples.

| Machine learning
After the above pre-processing workflow, the feature matrix was converted into both a naïve binary absence-presence matrix (replaced non-negative and missing value with 1 and 0 respectively in feature matrix, true labels were not utilized) and a dichotomized binary matrix (For each feature (m/z), a threshold is determined by considering true labels. Intensities above that threshold will be set to 1, otherwise 0, via R packages "binda" [version 1.0.4] [Gibb & Strimmer, 2015]). Hierarchical clustering was applied to the naïve binary feature matrix to figure out if different species can be separated in an unsupervised manner.
Binary discriminant analysis was then applied to the dichotomized binary feature matrix to identify and rank the most differentially expressed peaks across the spectra and ascertain whether any of these peaks from the lipid profiles could be used for cla184-ss prediction, i.e., to determine whether the spectra PIZZATO ET AL. | 3 of 14 belonged to a sample of E. coli, S. sonnei, or S. flexneri (by computing t-scores between the group means (in each species) and the pooled mean (across species). The top-ranked peaks were used to test their class prediction ability. Data were further split into training and testing data (randomly picked 70% of all samples for training and the rest for testing) to study the robustness of the top-ranked features in terms of classification of the three species.  (Casabuono et al., 2012;Krokowski et al., 2018;Lindberg et al., 1991;Paciello et al., 2013). The major peak at m/z 1796.2 corresponds to hexa-acyl diphosphoryl lipid A containing four 3-OH-C14:0 acyl groups, one C14:0 acyl group, and one C12:0 acyl group referred to as native lipid A (Casabuono et al., 2012;Lindberg et al., 1991;Paciello et al., 2013). In S. sonnei 3.2 | Machine learning allows discrimination of E. coli, S. sonnei, and S. flexneri clinical isolates Following the first workflow for pre-processing the data (Tang et al., 2019) (see details in pre-processing of lipid spectra data section above), the lipid profiles already led to the clustering of three different entire lipid profiles in an unsupervised manner (Figure 2).
Then, we identified top-ranked peaks and validated their robustness in this classification.
The dichotomized matrix was used for extracting top-ranked peaks (Table A1) based on multi-class discriminant analysis using binary predictors in a supervised manner (Gibb & Strimmer, 2015). The intensities of the 15 top-ranked peaks reported from "binda" efficiently distinguished E. coli, S. sonnei, and S. flexneri

| Validation of the robustness of top-ranked peaks via supervised learning and using randomly selected features as controls
To validate the robustness of the top-ranked peaks in terms of classification of the three bacterial species, data were further separated into training and testing data (70% out of 155 samples for training). The random separation of training and testing data was repeated 100 times. Various numbers of top-ranked peaks including all peaks (89 in total) and one set of randomly selected 15 peaks were used as controls. Consistent with a previous report from Tang et al. (2019) only minor differences could be found using either top-ranked peaks or whole peaks with respect to accuracy rates. Just a subset of top-ranked peaks would be enough for this classification problem (Conrad et al., 2017;Gibb & Strimmer, 2015). Using top-ranked peaks, we were able to achieve relatively higher performance in terms of four metrics precision, sensitivity, specificity, and F1 score for identifying E. coli and Shigella spp. In addition, when simply looking at accuracy rates based on the comparison between E. coli and Shigella spp. (S. flexneri + S. sonnei, around 0.9, Figure 4) or between S. flexneri and S. sonnei (around 0.87, Figure 5), it enables discrimination of E. coli from the other two species, which is consistent with the data shown in Figure 6. Overall, the five topranked peaks are sufficient for discriminating the different species (m/z 1929.7, m/z 1783.3, m/z 1570.3, m/z 1941.9, m/z 1950.3) (Table A1). whether lipid typing can help to discriminate between the closely related species E. coli and Shigella spp.

MALDI-TOF-MS is a valuable tool already in use in
Other methods have been attempted to accurately discriminate E. coli from Shigella, including protein-based MALDI-TOF MS and molecular methods, but due to their close relationship, these assays are not reliable. Currently, identification methods that are routinely used in clinical microbiology laboratories rely on a biochemical characterization using an API20E strip that induces 24 h delay to discriminate E. coli from Shigella spp. and on subsequent serotyping for Shigella (Devanga Ragupathi et al., 2018). Serotyping is based on the O-antigen expressed on the microbial surface (Allison & Verma, 2000;Gentle et al., 2016;Liu et al., 2021;Sun et al., 2011Sun et al., , 2012. This O-antigen polysaccharide is the outermost portion of LPS, and its variability among Gram-negative bacteria allows many pathogens F I G U R E 3 Top 15 ranked features reported from "binda" (ranked from left to right, from top to bottom). p-values come from the t-test.
F I G U R E 4 Bar plots showing accuracy values for species prediction (between Escherichia coli and combination of Shigella flexneri and Shigella sonnei) using top-ranked features; all features and randomly selected features. The accuracy values come from the analysis being repeated 100 times of splitting training and testing data and random selection of peaks for control.

F I G U R E 5
Barplots showing accuracy values for species prediction (between Shigella flexneri and Shigella sonnei) using top-ranked features; all features and randomly selected features. The accuracy values come from the analysis being repeated 100 times by splitting training and testing data and random selection of peaks for control.
F I G U R E 6 Classification metrics including precision, sensitivity, specificity, and F1 PIZZATO ET AL.
to be classified into serotypes. Serotyping done by agglutination tests is labor-intensive and errors due to serological cross-reactivity might occur (Muthuirulandi Sethuvel et al., 2017). Lipopolysaccharide, used for serotyping, is composed of the lipid A, inner core, outer core, and O-antigen. There is a large similarity between the O-antigen of E. coli and Shigella causing such cross-reactivity. However, in the approach described here, we treat the bacteria with acetic acid 1% v/v for 15 min, which leads to the enrichment of lipids including lipid A, as that treatment cleaves the LPS at the inner core. The remaining products of the hydrolysis are then washed and directly deposited onto the MALDI target plate. This is then overlaid with the matrix which contains apolar solvent (e.g., chloroform), favoring the ontarget extraction and ionization of apolar molecules such as phospholipids and glycolipids. We cannot rule out that the peaks we observed are degradation products of larger molecules allowing the actual discrimination between Shigella and E. coli despite the high level of similarity between their O-antigens.
In this context, our results suggested that MALDI-TOF MS profiling of lipids could provide a straightforward alternative method of species discrimination using the existing routine MALDI mass spectrometer system compared to serotyping which could lead to cross-reactivity and error in the interpretation of the bacterial strain identification. Indeed, MALDI-TOF MS is a technique that uses inexpensive reagents and produces robust and reproducible results in a short timeframe. Lipid profiling of bacterial species is an emerging field of research that has already been developed for the rapid identification of antimicrobial resistance traits such as polymyxin resistance using the MALDI-TOF negative mode (Dortet, Bonin, et al., 2018;Dortet et al., 2019Dortet et al., , 2020Dortet, Tande, et al., 2018;Furniss et al., 2019;Jeannot et al., 2021;Potron et al., 2019). To include this new field of investigation, a specific module has been added to the routine MALDI-TOF MS of Bruker, the MALDI Biotyper Sirius ® system. Accordingly, the implementation of this negative mode paved the way for the development of new methods for bacterial identification.
We should acknowledge that more research is needed before this method of discrimination between E. coli, S. sonnei, and S. flexneri becomes a gold standard in clinical microbiology laboratories. Indeed, to continue this study, it will be necessary to accumulate data on clinical isolates of S. dysenteriae and S. boydii that are less prevalent than S. sonnei and S. flexneri. For a comprehensive analysis, all four serogroups must be represented to validate lipid profiling as a reliable method of bacterial species identification. The bioinformatics analysis pipeline should be repeated with the new data set to determine whether the peaks with the best predictive ability to discriminate between E. coli, S. sonnei, and S. flexneri also work when S. boydii and S. dysenteriae are added to the list. In addition, not all m/z peaks observed in this study have a confirmed molecule assignment. Future work might also be performed to characterize these newly identified molecules. Of note, another limitation resides in the fact that not all mass spectrometers dedicated to routine microbiology laboratories possess the negative ion mode. However, a module dedicated to lipid profiling has been recently marketed by Bruker on its MALDI Biotyper Sirius ® system. However, this study is another proof-of-concept demonstrating that the lipid profiling performed on a routine MALDI-TOF MS machine with negative ion mode might be a reliable tool for the rapid identification of relevant pathogens. This method can address the needs of clinical microbiology diagnostics that are not met by other assays and help to determine effective treatment options.

CONFLICT OF INTEREST
None declared.

DATA AVAILABILITY STATEMENT
All data are provided in full in the results section of this paper.

ETHICS STATEMENT
None required.